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Abstract 

We investigate the problem of mining numerical 
data with Formal Concept Analysis. The usual way 
is to use a scaling procedure -transforming numer- 
ical attributes into binary ones- leading either to 
a loss of information or of efficiency, in particu- 
lar w.r.t. the volume of extracted patterns. By con- 
trast, we propose to directly work on numerical data 
in a more precise and efficient way. For that, the 
notions of closed patterns, generators and equiva- 
lent classes are revisited in the numerical context. 
Moreover, two algorithms are proposed and tested 
in an evaluation involving real-world data, showing 
the quality of the present approach. 

1 Introduction 

In this paper, we investigate the problem of mining numerical 
data. This problem arises in many practical situations, e.g. 
analysis of gene and transcriptomic data in biology, soil char- 
acteristics and land occupation in agronomy, demographic 
data in economics, temperatures in climate analysis, etc. We 
introduce an original framework for mining numerical data 
based on advances in itemset mining and in Formal Concept 
Analysis (FCA, I Ga nter and Wille, 1999 1), respectively con- 
densed representations of itemsets and pattern structures in 
FCA [Gan ter and Kuznetsov, 2 001 ]. The mining of frequent 
itemsets in binary data, considering a set of objects and a set 
of associated attributes or items, is studied for a long time 
and usually involves the so-called "pattern flooding" problem 
[Basti de et al, 2 0001. A way of dealing with pattern flooding 
is to search for equivalence classes of itemsets, i.e. itemsets 
shared by the same set of objects (or having the same image). 
For an equivalence class, there is one maximal itemset, which 
corresponds to a "closed set", and possibly several minimal 
elements corresponding to "generators" (or "key itemsets"). 
From these elements, families of association rules can be ex- 
tracted. These itemsets are also related to FCA , where a con- 
cept lattice is built from a binary context and where formal 
concepts are closed sets of objects and attributes. 

The present work is rooted both in FCA and pattern min- 
ing with the objective of extracting interval patterns from nu- 
merical data. Our approach is based on "pattern structures" 
where complex descriptions can be associated with objects. 



In [ Kaytou e et al. , 20TT) we introduced closed interval pat- 
terns in the context of gene expression data mining. Intu- 
itively, an interval pattern is a vector of intervals, each dimen- 
sion corresponding to a range of values of a given attribute ; 
it is closed when composed of the smallest intervals charac- 
terizing a same set of objects. 

In the present paper, we complete and extend this first at- 
tempt. Considering numerical data, some general character- 
istics of equivalence classes remain, e.g. one maximal ele- 
ment which is a closed pattern and possibly several genera- 
tors which are minimal patterns w.r.t. a subsumption relation 
defined on patterns. We show that directly extracting patterns 
data from numerical is more efficient using pattern structures 
than working on binary data with associated scaling proce- 
dures. We also provide a semantics to interval patterns in the 
Euclidean space, design and experiment algorithms to extract 
frequent closed interval patterns and their generators. 

The problem of mining patterns in numerical data is usu- 
ally referred as quantitative itemset/association rule min- 
ing I jSrikant and Agra wal, 1996]. Generally, an appropriate 
discretization splits attribute ranges into intervals maximiz- 
ing some interest functions, e.g. support, confidence. How- 
ever, none of these works covers the notion of equivalence 
classes, closed patterns, and generators, and this is one of the 
originality of the present paper. 

The plan of the paper is as follows. Firstly, we introduce 
the problem of mining numerical data and interval patterns. 
Then, we recall the basics of FCA and interordinal scaling. 
We pose a number of questions that we propose to answer 
using our framework of interval pattern structures dealing 
with numerical data. We then detail two original algorithms 
for extracting closed interval patterns and their generators. 
These algorithms are evaluated in the last section on real- 
world data. Finally, we end the paper in discussing related 
work and giving perspectives to the present research work. 
As a complement, an extended version of this paper is given 
in I jKaytoue et al, 20 101, providing algorithms pseudo-code 
and a longer discussion on the usefulness of interval pat- 
terns in classification problems and privacy preserving data- 
mining. 

2 Problem definition 

We propose a definition of interval patterns for numerical 
data. Intuitively, each object of a numerical dataset corre- 



sponds to a vector of numbers, where each dimension stands 
for an attribute. Accordingly, an interval pattern is a vector of 
intervals, and each dimension describes the range of a numer- 
ical attribute. We only consider finite intervals and that the set 
of attributes/dimensions is assumed to bed (canonically) or- 
dered. 

Numerical dataset. A numerical dataset is given by a set of 
objects G, a set of numerical attributes M, where the range 
of to E M is a finite set noted W m . m(g) — w means that w 
is the value of attribute m for object g. 
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Table 1 : A numerical dataset. 

Interval pattern and support. In a numerical dataset, an in- 
terval pattern is a vector of intervals d = ([at, m|} 
where ai, bi £ W mi , and each dimension corresponds to an 
attribute following a canonical order on vector dimensions, 
and \M\ denotes the number of attributes. An object g is 
in the image of an interval pattern ([ai, &i])ie{l,...,|M|} when 
rrii(g) £ [a,i,bi], Mi £ {1, \M\]. The support of d, de- 
noted by sup(d), is the cardinality of the image of d. 

Running example. Table[T]is a numerical dataset with objects 
in G = {gi, #5}, attributes in M = {mi, m2, TO3}. The 
range of mi is W mi = {4, 5, 6}, and we have mi(gi) = 5. 
Here, we do not consider either missing values or multiple 
values for an attribute. ([5, 6], [7, 8], [4, 6]) is an interval pat- 
tern in Table [T] with image {<7i, #2, 55} and support 3. 

Interval pattern search space. Given a set of attributes 
M = {fni}ie{i,\M\}> me search space of interval patterns 
is the set D of all interval vectors ([a,;, bi]) ie ^i |m|}> with 
di,bi £ W mi . The size of the search space is given by 

\D\= I] (l^| x (|W roi | + l))/2 
»e{i,...,|M|} 

Example. All possible intervals for mi are in 

{[4,4], [5,5], [6,6], [4,5], [5,6], [4,6]}. Considering also 
attributes TO2 and ms, we have 6 x 6 x 10 = 360 patterns. 

The classical problem of "pattern flooding" in data-mining 
is even worst for numerical data. Indeed, with three attributes, 
there are only 2 3 = 8 possible itemsets, compared to the 360 
interval patterns in the above example with same number of 
attributes. A solution widely investigated in itemset-mining 
for minimizing the effect of pattern flooding relies on con- 
densed representations including closed itemsets and gener- 
ators iBastide etal, 20001. By contrast, the analysis of nu- 
merical datasets can be considered within the formal concept 
analysis framework (FCA) [Gant erand Wille, 1999), which 
is closely related to itemset-mining | |Stumme et al., 20021. 
Accordingly, we are interested in adapting the notions of (fre- 
quent) closed itemsets and their generators to interval patterns 
within the FCA framework, and in providing an appropriate 
semantics to these patterns. 



3 Interval patterns in FCA 

| |Ganter and Wille,~19 99 1 define a discretization procedure, 
called interordinal scaling, transforming numerical data into 
binary data that encodes any interval of values from a numeri- 
cal dataset. We recall here the basics on FCA and interordinal 
scaling. 

3.1 Formal concept analysis 

FCA starts with a formal context (G, N, I) where G de- 
notes a set of objects, N a set of attributes, or items, and 
/ C G x N a binary relation between G and N. The 
statement (g,n) £ I, or gin, means: "the object g has at- 
tribute n". Two operators (•)' define a Galois connection be- 
tween the powersets (*P(G), C) and (¥(N), C), with AC G 
and B C N: A' = {n £ N | Mg £ A : gin} and 
B' = {g £ G I Vn £ B : gin}. A pair (A, B), such that 
A' = B and B' = A, is called a (formal) concept, while A is 
called the extent and B the intent of the concept (A, B). 

From an itemset-mining point of view, concept intents cor- 
respond to closed itemsets, since (.)" is a closure operator. 
An equivalence class is a set of itemsets with same closure 
(and same image). For any subset B C TV, B" is the largest 
itemset w.r.t. set inclusion in its equivalence class. Dually, 
generators are the smallest itemsets w.r.t. set inclusion in an 
equivalence class. Precisely, B C N is closed iff $C such as 
B C G with C' = B' ; B C N is a generator iff $C C B 
with C' = B'. 

3.2 Interordinal scaling 

Given a numerical attribute m with range W m , Interordinal 
Scaling builds a binary table with 2 x \ W m \ binary attributes. 
They are denoted by "m < w" and "to > w", Mw £ W m , and 
called IS-items. An object g has an IS-item "m < w" (resp. 
"to > w") iff m(g) < w (resp. m(g) > w). Applying this 
scaling to our example gives Table [2] It is possible to apply 
classical mining algorithms to process this table for extracting 
itemsets composed of IS-items. These itemsets are called IS- 
itemsets in the following. 

IS-itemsets can be turned into interval patterns, since an 
IS-item gives a constraint on the range W m of an attribute 
m. For example, the IS-itemset {mi < 5, mi < 6,mi > 
4,to,2 < 9,TO2 > 7} corresponds to the interval pattern 
([4, 5], [7, 9], [4, 8]). We have here the interval [4, 8] for at- 
tribute TO3 ; [4, 8] covers the whole range of m.3 since no con- 
straint is given for 7713. 

Therefore, mining interval patterns can be considered with 
a scaling of numerical data. However, this scaling produces 
a very important number of binary attributes compared to the 
original ones. Hence, when original data are very large, the 
size of the resulting formal context involves hard computa- 
tions. Accordingly, this raises the following questions: 

(i) Can we avoid scaling and directly work on numerical 
data instead of searching for IS-itemsets? (ii) Can we adapt 
the notions of condensed representations such as closed pat- 
terns and generators for numerical data, and efficiently com- 
pute those patterns? (iii) What would be the semantics that 
could be provided to closed patterns and generators? 
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Table 2: Interordinally scaled context encoding the numerical dataset from TableQ] 



4 Revisiting numerical pattern mining 

In this section, we answer those questions. First, we show that 
a closure operator can be defined for interval patterns based 
on their image. Then, we provide interval patterns with an 
appropriate semantics for defining the notion of equivalence 
classes of patterns, closed and generator patterns. After dis- 
cussing why working with interordinal scaling is not accept- 
able thanks to the semantics of interval patterns, we propose 
two efficient algorithms for mining closed interval patterns 
and generators. Experiments follow in Section 5. 

4.1 A closure operator for interval patterns 

We introduce the formalism of pattern structures 
l |Ganter and Kuznetsov, 20 01 1, an extension of formal 
contexts for dealing with complex data in FCA. It defines 
a closure operator for a partially ordered set of object 
descriptions called patterns. 

Formally, let G be a set of objects, (D, l~l) be a semi- 
lattice of object descriptions, and 5 : G — > D be a mapping: 
(G, (D, n), 5) is called a pattern structure. Elements of D are 
called patterns, and are ordered as follows c l~l d = c •<=>• 
c C d. Intuitively, objects in G have descriptions in (D, l~l). 
For example, g\ in Table 1 has description ([5,5], [7, 7] , [6,6]) 
where D is the set of all possible interval patterns ordered 
with n, as made precise below. Consider the two operators 
(.) n defined as follows, with AC G and d G (D, n) 

d n = {g e G\d C 5(g)} A n = [] 



g<EA S (9) 

a Galois connection between 
(■) DD is a closure operator, 



These operators form 
and (D,Q. 

meaning that any pattern d such as d = d uu is closed 

Interval pattern structures. This general closure operator 
can be used for interval patterns. Indeed, interval patterns 
can be ordered within a meet-semi-lattice when the infimum 
is defined as follows. Let c = ([<Zj, |M|}> ar, d d = 

([ e i, ft])ie{i,...,\M\} two intervals patterns. Their infimum is 
given by c n d = ([min(ai, e<), max(b h /i)])ie{i,...,|Af|}- 
The ordering relation induced by this definition is: c C 
d [eiJi] C [ au bi], Vz e {1,...,|M|}. 

Consider now a numerical dataset, e.g. TableQ] (D, C) is 
the finite ordered set of all interval patterns. 5(g) G D is the 
pattern associated to an object g G G. Then: 

([5, 6], [7, 8], [4, 8]) D = {g G G|([5, 6], [7, 8], [4, 8]) C 5(g)} 

= {91,92,95} 
{91,92,95}° = 5( 9l ) H 5(g 2 ) H 5(g 3 ) 
= ([5, 6], [7, 8], [4, 6]) 



This means that ([5, 6], [7, 8], [4, 8]} is not a closed interval 
pattern, its closure being ([5,6], [7, 8], [4, 6]}. 

4.2 Semantics 

An interval pattern d is a \M\ -dimensional vector of intervals 
and can be represented by a hyper-rectangle (or rectangle for 
short) in Euclidean space W) M ' , whose sides are parallel to the 
coordinate axes. This geometrical representation provides a 
semantics for interval patterns. In formal terms, an interpre- 
tation is given by 1 = (IRl M l , (.) x ) where Rl M l is the in- 
terpretation domain, and : D — »■ Rl M l the interpretation 
function. Figure Q] gives four interval pattern representations 
in R 2 , with only attributes mi and 777,3 of our example. The 
image of d\ is given by all objects g whose description 5(g) 
is included in the rectangle associated with d\, i.e. the set 
{91, 93, 94,, 95}- We can interpret the closure operator 
according to this semantics. The first operator applies 
to a rectangle and returns the set of objects whose descrip- 
tion is included in this rectangle. The second operator 
applies to a set of objects and returns the smallest rectangle 
that contains their descriptions, i.e. the convex hull of their 
corresponding descriptions. 




di -- 
d? 
d 2 - 
d a 
d 3 -- 

d D 
d4 - 



([4, 5], [5, 8]) 

: {9i,93,94,g5} 
([4, 5], [4, 5]) 

= {93,95} 
([5,6], [4,4]) 

-{92} 

([6, 6], [4, 8]) 
-{92} 



Figure 1: Interval patterns in the Euclidean space. 

4.3 Closed interval patterns and generators 

Now, we can revisit the notion of equivalence classes of item- 
sets as introduced in iBastide et al, 2000|: an equivalence 
class of interval patterns is a set of rectangles containing the 
same object descriptions (based on all rectangles in the search 
space as given in Section 2). This enables to define the no- 
tions of (frequent) closed interval patterns ((F)CIP) and (fre- 
quent) interval pattern generators ((F)IPG), adapted itemisets. 

Equivalence class. Two interval patterns c and d with same 
image are equivalent, i.e. c D = cP and we write c = d. = is 
an equivalence relation, i.e. reflexive, transitive and symmet- 
ric. The set of patterns equivalent to a pattern d is denoted by 
[d] = {c\c = d} and called the equivalence class of d. 



Closed interval pattern (CIP). A pattern d is closed if there 
does not exist any pattern e such as d C e with d = e. 

Interval pattern generator (IPG). A pattern d is a genera- 
tor if there does not exist a pattern e such as e Q d with d=e. 
Frequent Interval pattern. A pattern d is frequent if its im- 
age has a higher cardinality than a given minimal support 
threshold minSup. 

We illustrate these definitions with two dimensional in- 
terval patterns, and their representation in Figure [1] i.e. 
considering attributes mi and m.3 only. ([4, 5], [6, 8]) = 
([4, 6], [6, 8]} with image {#1,54}. ([4, 6], [6, 8]) is not closed 
as ([4, 6], [6, 8]} C ([4, 5], [6, 8]}, these two patterns having 
same image, i.e. {g x , g 3 , g 4 , g 5 }. ([4, 5], [5, 8]} is closed. 
([4, 6], [5, 8]} and ([4, 5], [4, 8]} are generators in the class of 
the closed interval pattern di — ([4, 5], [5, 8]) with image 
{91, ff3j 54 j 55}- Among the four patterns in Figure Q] d\ is 
the only frequent interval pattern with minSup — 3. 

Based on the above semantics, an equivalence class is a set 
of rectangles containing the same set of object descriptions, 
with a (unique) closed pattern corresponding to the smallest 
rectangle, and one or several generator(s) corresponding to 
the largest rectangle(s). 

These definitions are counter-intuitive w.r.t. itemsets: the 
smallest rectangles subsume the largest ones. This is due to 
the definition of infimum as set intersection for itemsets while 
this is convex hull for intervals, which behaves dually as a 
supremum. 

4.4 IS-itemsets versus interval patterns 

Interordinal scaling allows to build binary data encoding all 
interval of values from a numerical dataset. Therefore, one 
may attempt to mine closed itemsets and generators in these 
data with existing data-mining algorithms. Here we show 
why this should be avoided. 

Local redundancy of IS-itemsets. Extracting all IS- 
itemsets in our example (from Table [2]) gives 31,487 IS- 
itemsets. This is surprising since there are at most 360 pos- 
sible interval patterns. In fact, many IS-itemsets are locally 
redundant. For example, {mi < 5} and {mi < 5, mi < 6} 
both correspond to interval pattern ([4, 5], [7, 9], [4, 8]): the 
constraint mi < 6 is redundant w.r.t. mi < 5 on the set of 
values W mi . Hence there is no 1-1-correspondence between 
IS-itemsets and interval patterns. It can be shown that there 
is a 1-1-correspondence only between closed IS-itemsets and 
CIP l |Kaytoue et ah, 20TT I . Later we show that local redun- 
dancy of IS-itemsets makes the computation of closed sets 
very hard. 

Global redundancy of IS-itemset generators. Since IS- 
itemset generators are the smallest itemsets, they do not 
suffer of local redundancy. However, we can remark an- 
other kind of redundancy, called global redundancy: it hap- 
pens that two different and incomparable IS-itemset genera- 
tors correspond to two different interval pattern generators, 
but one subsuming the other. In Table |2j both IS-itemsets 
Ni = {mi < 4,m 3 < 5} and N 2 = {mi < 4,m 3 < 6} 
have the same image {53} and are generators, i.e. there does 
not exist a smaller itemset of these itemsets with same image. 
However, their corresponding interval pattern are respectively 



c = ([4,4], [7,9], [4,5]} and d = ([4,4], [7,9], [4,6]) and we 
have dQ c, while c D = dP, hence c is not an interval pattern 
generator. 

4.5 Algorithms 

We detail a depth-first enumeration of interval patterns, start- 
ing with the most frequent one. Based on this enumeration, 
we design the algorithms MinlntChange and MinlntChangeG 
for extracting respectively frequent closed interval patterns 
(FCIP) and frequent interval pattern generators (FIPG). 

Interval pattern enumeration. Consider firstly one numer- 
ical attribute of the example, say m%. The semi-lattice of 
intervals (D mi , is composed of all possible intervals with 
bounds in W mi and is ordered by the relation C. The unique 
smallest element w.r.t. C is the interval with maximal size, 
i.e. [4,6] = [min(W mi ), max(W mi )] and maximal fre- 
quency (here 5). The basic idea of pattern generation lies 
in minimal changes for generating the direct subsumers of a 
given pattern. For example, two minimal changes can be ap- 
plied to [4, 6]. The first consists in replacing the right bound 
with the value of W mi immediately lower that 6, i.e. 5, for 
generating the interval [4, 5]. The second consists in repeating 
the same operation for the left bound, generating the interval 
[5,6]. Repeating these two operations allows to enumerate 
all elements of (D mi , A right minimal change is defined 
formally as, given a,b,v 6 W m , a ^ b, mcr([a, b]) = [a, v] 
with v < b and $w 6 W m s.t. v < w < b while a left 
minimal change mcl([a, b]) is formally defined dually. Mini- 
mal changes give direct next subsumers and implies a mono- 
tonicity property of frequency, i.e. support of [a, v) is less 
than or equal to support of [a, 6]. To avoid generating several 
times the same pattern, a lectic order on changes, or equiva- 
lently on patterns, is defined. After a right change, one can 
apply either a right or left change; after a left change one 
can apply only a left change. Figure [2] shows the depth-first 
traversal (numbered arrows) of diagram of (D mi , Back- 
track occurs when an interval of the form [w, w] is reached 
(w 6 W mi ), or no more change can be applied. Each minimal 
change can be interpreted in term of an IS-item. For example, 
if [a, b] corresponds to the IS-itemsets {m > a,m < b} then 
mcr([a, b]) — [a, v] corresponds to {m > a, m < b, m < v}, 
i.e. adding m < v to the original IS-itemset. The same ap- 
plies dually to left minimal changes. These IS-items charac- 
terizing minimal changes are drawn on Figure [2] This figure 
accordingly represents a prefix-tree, factoring out the effort to 
process common prefixes or minimal changes, and avoiding 
redundancy problems inherent in interordinal scaling. The 
generalization to several attributes is straightforward. A lec- 
tic order is classically defined on numerical attributes as a 
lexicographic order, e.g. mi < m 2 < m,3. Then changes are 
applied as explained above for all attributes respecting this or- 
der, e.g. after applying a change to attribute m 2 , one cannot 
apply a change to attribute mi. 

Extracting FCIP with MintlntChange. The pattern enu- 
meration starts with the minimal pattern w.r.t C and gener- 
ates its direct subsumers with lower or equal support. The 
next problem now is that minimal changes do not necessarily 
generate patterns with strictly smaller support. Therefore, we 




Figure 2: Depth-first traversal of (D mi , 

should apply changes until a pattern with different support is 
generated to identify a closed interval pattern (FCIP) but this 
would not be efficient. We adopt the idea of the algorithm 
CloseByOne [Kuzne tsov and Obiedkov, 2002| |: before apply- 
ing a minimal change, the closure operator is applied 
to the current pattern, allowing to skip all equivalent patterns. 
Indeed, the minimal pattern d w.r.t. C is closed as it is given 
by d = G D . Applying a minimal change returns a pattern 
c with strictly smaller support, since d C c and d is closed. 
If c is frequent, we can continue, apply the closure opera- 
tor and next changes in lectic order, allowing to completely 
enumerate all FCIP. Since a FCIP may have several different 
associated generators, it can be generated several times. Still 
following the idea of CloseByOne, a canonicity test can be 
defined according to lectic order minimal changes. 

Consider a pattern d generated by a change at attribute 
m.j € M. Its closure is given by d aa . If d aa differs from 
d for some attributes my l e M such as m^ < rrij, then dP D 
has already been generated: it is not canonically generated, 
hence the algorithms backtracks. 

Example. We start from the minimal pattern d — 
([4,6], [7,9], [4,8]). The first minimal change in lectic or- 
der is a right change on attribute mi. We obtain pattern 
c = ([4, 5], [7, 9], [4, 8]}, and obviously d C c. However, 
c DD = ([4,5], [7,9], [5,8]), hence c is not closed. c DD is 
stored as FCIP and next changes will be applied to it. 

Now consider the pattern obtained by minimal change on 
left border for attribute m 3 , i.e. e = ([4,6], [7,9], [5,8]). 
We have e DD = ([4,5], [7,9], [5,8]). e and e nn differ for 
attribute mi, but e has been generated from a change on 
1713. Since mi < 7713, e DD is not canonical and has al- 
ready been generated (previous example), hence the algo- 
rithm backtracks. 

Extracting FIPG with MintlntChangeG. We 

now adapt MinlntChange to extract FIPG, follow- 
ing a well-known principle in itemset-mining algo- 
rithms | |Calders and Goethals, 20051 . For any FCIP d, a 
minimal change implies that the support of the resulting 
pattern c is strictly smaller than the support of d. Therefore, c 
is a good generator candidate of the next FCIP. Accordingly, 
at each step of the depth-first enumeration a FIPG candidate 
c is generated from the previous one b, by applying a minimal 
change characterized by b aa . Then, each candidate c has to 
be checked whether it is a generator or not. We know that the 
candidate has no subsumers in its branch with same support. 
However, it could exist a branch with another FIPG e with 
same image and resulting from less changes. Considering the 
lectic order on minimal changes, we use a reverse traversal 
of the tree (see Figure |2 7,8,9,10,1,4,5,2,3,6), as already 



suggested in the binary case in |Calders and Goethals, 20051. 
Since generators correspond to largest rectangles, i.e. on 
which the fewest minimal changes have been applied, if c is 
not a generator, a generator e associated to its equivalence 
class has already been generated, and c is discarded. To 
check the existence of e, we look up in an auxiliary data- 
structure storing already extracted FIPG. Precisely, if the 
data structure contains a FIPG e with same support than 
candidate c, such that e □ c, c is discarded, and the algorithm 
backtracks. Otherwise c is declared as a FIPG and stored. 
We have experimented the MinlntChangeG algorithm with 
two well-known and adapted data structures, a trie and a 
hashtable. 

5 Experiments 

We evaluate the performances of the algorithms designed 
in Java, namely MinlntChange, MinlntChangeG-h with 
auxiliary hashtable and MinlntChangeG-t with auxiliary 
trie. Recalling that closed IS-itemsets and CIP are in 1- 
1 -correspondence, we compare the performance for min- 
ing interordinal scaled data with the closed-itemset-mining 
algorithm LCMv2 | |Unoef al, 2004| . For studying the 
global redundancy effect of IS-itemset generators, we use 
the generator-mining-algorithm GrGrowth | |Liu et q/~ 2006|. 
Both implementations in C++ are available from the authors. 
All experiments are conducted on a 2.50Ghz machine with 
16GB RAM running under Linux 2.6.18-92.el5. We choose 
dataset from the Bilkent repositorjQ, namely Bolts (BL), Bas- 
ketball (BK) and Airport (AP), AP being worst case where 
each attribute value is different. 

First experiments compare MinlntChange for extracting 
FCIP and LCMv2 for extracting equivalent frequent closed 
IS-itemsets in Table [3] Second experiments consist in ex- 
tracting frequent interval pattern generators (FIPG) with 
MinlntChange-h and MinlntChange-t. We also extract fre- 
quent itemset generators (FISG) in corresponding binary data 
after interordinal scaling with GrGrowth for studying the 
global redundancy effect in Table [4] 



Dataset 


minSupp 


MinlntChange 


LCMv2 


\FCIP\ 


BL 


80% 


< 50 


< 50 


1,130 




50% 


252 


100 


32,107 




25% 


1,215 


1,060 


171,192 




10% 


1,821 


1,950 


268975 




1 


1,905 


2,090 


272,223 


AP 


80% 


4,595 


1,470 


346,741 




50% 


143,939 


149,580 


16,214,345 




25% 


413,805 


899,180 


58,373,631 




10% 


506,985 


6,810,125 


80,504,566 




1 


517,548 


6,813,591 


82,467,124 



Table 3: Execution time for extracting FCIP (in ms). 



In both cases, using binary data is better when the mini- 
mal support is high (e.g. 90%). For low supports, a critical 
issue, our algorithms deliver better execution times. Most 
importantly, the global redundancy effect discards the use of 
binary data, e.g. only 1.6% of all FISG are actually FIPG in 
dataset BL. Finally, the algorithm MinlntChangeG-t outper- 
forms MinlntchangeG-h. MinlntChangeG-t however needs 

'http://funapp.cs.bilkent.edu.tr/ 



Dataset 


minSupp 


GrGrowth 


MinlntChangeG-h 


MinlntChangeG-t 


\FIPG\ 


\FISG\ 


\FIPG\ 


\FCIP\ 






\FISG\ 


PC IP 




BL 


90% 
80% 
50% 
25% 
1 


< 50 

< 50 
150 

3,432 

123,564 


< 50 

< 50 
1,212 

27,988 
438,214 


< 50 

< 50 
529 

3,893 
24,141 


176 
1 ,952 
66,350 
411,442 
1,165,824 


194 
2.823 
222,088 
3,559,419 
69,646,301 


90% 
69% 
29% 
11% 
1.6% 


112 
1, 130 

32,107 
171,192 
272,223 


1.57 
1.73 
2 

2.4 
4.3 


BK 


90% 


< 50 


1,268 


1,207 


67,737 


75,058 


84% 


48,847 


1.3 




85% 


4,565 


26,154 


12,139 


554,956 


799,574 


69% 


403,562 


1.37 




80% 


Untractable 


512,126 


107,700 


2,730,812 


NA 


NA 


1,938,984 


1.40 



Table 4: Execution time for extracting 

more memory since it stores each closed set of objects as a 
word in the trie, and to each word the list of associated FIPG. 

It is very interesting to analyse the compression ability of 
closed interval patterns and generators. For that, we compare 
in each dataset the number of those patterns w.r.t. to all pos- 
sible interval patterns. It gives the ratio of closed (generators) 
in the whole search space. In both cases, ratio varies between 
1CP 7 and 10 ~ 9 . This means that the volume of useful inter- 
val patterns, either closed or generators, is very low w.r.t. the 
set of all possible interval patterns, justifying our interest in 
equivalence classes for interval patterns. 

6 Conclusion 

We discussed the important problem of pattern discovery in 
numerical data with a new and original formalization of in- 
terval patterns. The classical FCA/itemset-mining settings 
are adapted accordingly: from a closure operator naturally 
rise the notions of equivalence classes, closed and genera- 
tor patterns, and we designed corresponding algorithms. An 
appropriate semantics of interval patterns shows from a theo- 
retical (redundancy) and practical (computation times) points 
of view that mining equivalent binary data (encoding all pos- 
sible intervals) is not acceptable. This is due to the fact that 
interval patterns are provided with a stronger partial ordering 
than IS-itemsets (classical set inclusion), hence pattern struc- 
tures yield significantly less generators w.r.t. their semantics. 

Dealing with interval patterns has applications in com- 
putational geometry, machine learning and data-mining, 
e.g. IBoros et ah, 2003] and references therein. It is indeed 
highly related to the actual problem of (maximal) fc -boxes 
which corresponds to interval patterns (generators) with sup- 
port fc. When k = 0, it corresponds to largest empty sub- 
spaces of the data. Our contribution to this field is the char- 
acterization of smaller subsets (closed and generators). 

In data-mining, closed patterns and their generators are 
crucial for extracting valid and informative association 
rules [Bas tide et ah, 200"0) , while generators can be prefer- 
able to closed patterns following the minimum descrip- 
tions length principle for so-called itemset-based classi- 
fiers l |Li et ah , 2006) . How these notions can be shifted to 
interval patterns is an original perspective of research rising 
questions concerning missing values, fault-tolerant patterns, 
and interestingness measures that are critical issues even in 
classical itemset mining: although the compression ability 
of closed interval patterns and generators is spectacular, the 
number of patterns remains too high for large datasets. How- 
ever, bringing the problem of numerical pattern mining into 
well known settings in favor of these perspectives of research. 
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